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Abstract — We study the problem of efficient compression 
of a stochastic source of probability distributions. It can be 
viewed as a generalization of Shannon's source coding prob- 
lem. It has relation to the theory of common randomness, 
as well as to channel coding and rate— distortion theory: in 
the first two subjects "inverses" to established coding the- 
orems can be derived, yielding a new approach to proving 
converse theorems, in the third we find a new proof of Shan- 
non's rate— distortion theorem. 

After reviewing the known lower bound for the optimal 
compression rate, we present a number of approaches to 
achieve it by code constructions. Our main results are: 
a better understanding of the known lower bounds on the 
compression rate by means of a strong version of this state- 
ment, a review of a construction achieving the lower bound 
by using common randomness which we complement by 
showing the optimal use of the latter within a class of proto- 
cols. Then we review another approach, not dependent on 
common randomness, to minimizing the compression rate, 
providing some insight into its combinatorial structure, and 
suggesting an algorithm to optimize it. 

The second part of the paper is concerned with the gener- 
alization of the problem to quantum information theory: the 
compression of mixed quantum states. Here, after review- 
ing the known lower bound we contribute a strong version 
of it, and discuss the relation of the problem to other issues 
in quantum information theory. 

I. Sources of distributions 

A theorem of Shannon [|l6| basic to aU information the- 
ory describes the optimum compression of a discrete mem- 
oryless source, showing that the minimum achievable rate 
is the entropy of the source distribution. The situation is 
the foUowing: 

Let P be a probabiUty distribution on the finite set X. 
We call {E,D) an (n, A) -code for the discrete memoryless 
source P, if 



E : X" 
D:C- 



A"" 



are stochastic maps, with a finite set C, such that 
J2 P"{x'') Pr{a;" = D{E{x''))} > 1 - A, 



(1) 



(2) 



where 



Pr{x" = DiEix"-))} = D(£;(a:")){a:"}. 



Denoting the minimal \C\ such that an (n, A) code exists, 
by M(n, A), Shannon Q shows that for A e (0, 1) 

lim - log M{n,X) = H{P), 

n — *oo fi 
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with the entropy H{P) — — P{x) log P{x) of the dis- 
tribution. 

Motivated by the work and by a construction in Q 
(in footnote 4) , we study here the following modification of 
this problem: 

To each x £ X \s associated a probability distribution 
Wx on the finite set y (thus W is a stochastic map, or 
channel, form X toy). An (n, A)-code is now a pair (i?, D) 
of stochastic maps 



E -.X"^ — >C, 
D -.C — yy"- 



(3) 



(compare with eq. (|l|)), and instead of condition (^ we 
impose 

p-{xn\\\w:.^D{E{x-mi<\ (4) 

where || • ||i is the ^"'^-norm on function on 3^": = 
l/(y")l- Note that for two probability distributions P 
and Q, ^\\P — Q\\i equals their total variational distance 
dTv(P, Q) = sup_4ej;" \P{A)-Q{A) \ of the two. We define 
AI{n, A) to be the minimal \C\ of an (n, A)-code. 

Note that for y — X, and Wx the point-mass in x, 
the new notion of (n, A)-code coincides with the previous 
one. Notice further, that we allow probabilistic choices in 
the encoding and decoding. While it is easy to see that this 
freedom does not help in Shannon's problem, it is crucial 
for the more general form, that we will study in this paper. 

The basic problem of course is to find the optimum rate 



Tx{P,W)= lim -logM(n,A) 

n — >oo Ji 

of compression (if the limit exists; otherwise lim sup is to be 
considered), and especially the behaviour of this function 
at A ^ 0. 

For the case A = 0, i.e. perfect restitution of the distri- 
butions Wx, these definitions in principle make sense, but 
we don't expect a neat theory to emerge. Instead we define 

S{n) = min H{E{P®'')), 

the minimal entropy of the distribution on C induced by the 
encoder E (with the idea that blocks of these n-blocks we 
may data compress to this rate). Obviously S{ni -f ^2) < 
S{ni) + 8(712), so the limit 

r{P,W) = lim -S{n) 
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exists, and is equal to the infinium of the sequence. To 
evaluate this quantity is another problem we would like to 
solve. 

The structure of this paper is as follows: first we find 
lower bounds (section then discuss up per bounds, pre- 
ferrably by constructing codes: in section we show how 
the lower bound is approached by using the additional 
resource of common randomness, in section ^ we prove 
achievability of it under a letterwise fidelity criterion as a 
consequence of this result, section ^ presents a construc- 
tions to upper bound T and Tx . In section ^ applications 
of the results and conjectures are presented: first, we make 
it plausible that the distillation procedure of Q is asymp- 
totically reversible, second we show that Shannon's coding 
theorem allows an "inverse" (at least in situations where 
unlimited common randomness is around), third we give 
a simple proof that feedback does not increase the rate 
of a discrete memoryless channel, and fourth demonstrate, 
how Shannon's rate-distortion theorem follows as a corol- 
lary. The compression result (with or without common 
randomness) thus reveals a great unifying power in classi- 
cal information theory. Finally, in section VII we discuss 
extensions of our results to the case of a source of mixed 
quantum states: the present discussion fits into this mod- 
els as probability distributions are just commuting mixed 
state density operators. 

Let us mention here the previous work on the problem: 
the major initiating works are [ pT| and The latter in- 
troduced the distinction between blind and visible coding, 
and between the block- and letterwise fidelity criterion. In 
contrast to the pure state case the four possible combina- 
tions of these conditions seem to lead to rather different 
answers. The case of blind coding with either the letter- 
or blockwise fidelity criterion was solved recently by Koashi 
and Imoto | p2| . Otherwise in this paper, we will only ad- 
dress the visible case. An attempt on the letterwise fi- 
delity case with either blind or visible encoding was made 
in |13| . However, an examination of the approach of this 
work shows that it does not fit into any of the the classes 
of fidelity criteria proposed by Q: for a code {E,D) one 
could either apply the global criterion, which is essentially 
our eq. (^) , that is definitely not what is considered in ||l^ , 
there being employed rate distortion theory. 

Or one could impose that the output E{D{x^)) is good 
on the average letterwise (the local criterion of fel): 



^P"(x") 



■ n 



< A, (5) 



where D{E{x'^))k denotes the marginal distribution of 
D{E{x")) on the fc'^ factor in y", and d is any dis- 
tance measure on probability distributions (that we re- 
quire only to be convex in the second variable). For 
d{P,Q) = i||P - gill this is implied by eq. (|). This, 
too, is not met in |li, as there E and D are constructed 
as deterministic maps, while to satisfy eq. (^ one needs at 
least a small amount of randomness. 



To achieve this one could base the fidelity condition on 
looking at individual letter positions of source and output 
simultaneously: 



k=l \ x"-:Xk=x 



P"{x"] 

TixT 



D{E{x^))k 



< X. 



(6) 



Condition (j|) being weaker than (^), this one is still weaker. 
However, this, too, does not coincide with the criterion 
of |l^ : denoting by G the joint distribution of x and y 
according to P and W, i.e. G{xy) = P{x)Wx{y), one con- 
siders 



Y,P-{x-)d[G,-Y,5. 

X" V fc=l 



®D{E{x''))A <\ (7) 



(This is imphed by eq. (1) of ||T|l for e = 5 = A/2, which 
in turn is implied by eq. (^) for t = 5 = ^/\). It is not 
at all clear how to connect this with any of the above: 
eq. (|^) is about the empirical joint distribution of letters 
in x" and D{E{x'^)) (assume for simplicity, as indeed the 
authors of do, that E and D are deterministic), that is 
about a distribution created by selecting a position k ran- 
domly, while eqs. (||) to (^) are about distributions created 
either by the coding process alone or in conjunction with 
the source. Our view is confirmed in an independent recent 
analysis of by Soljanin to the same effect. 

An interesting new twist was added when in |q] ( and later 
in a more extended way in ||^ and the recent |18| ) the use 
of unlimited common randomness between the sender and 
receiver was allowed in the visible coding model with block- 
wise fidelity criterion. As already mentioned, we reproduce 
this result here in detail, with special attention to the re- 
source of common randomness; we present a protocol for 
which we prove that it has minimum common randomness 
consumption in the class of protocols which even simulate 
full passive feedback of the received signal to the sender. 

II. Lower bound and conjectures 

Let the random variable AT" — (Ai,... ,A„) be dis- 
tributed according to P". Then we can define F" by 

Pr{y" = y"|A" = x"} = W^"4y"). 

By (|^) we have the Markov chain 

A"-e-£;(A")-e-i:'(£;(A")) « y". 

Using data processing inequality as follows: 

iog|c| >i/(ii;(A")) 

> /(A" A £;(A")) 

> /(A" A D{E{X'')) 
>/(A"Ar")-n/(A), 

with /(A) for A ^ 0. To be precise, one may choose 
(for A < 1/2) 

/(A) = A(log|A'| + 21og|J|) + 2/i(A), 
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employing the following well known result with eq. 

Lemma 1: Let P and Q be probability distributions on 
a set with finite cardinality a, such that \\P — Q\\i < 1/2. 
Then 

\H{P)~H{Q)\ <ah(-\ :=-Alog-. 

\a J a 

Proo/. See e.g. II . ■ 

Thus we arrive at 

Theorem 2: For any n and < A < 1: 



1 



logM(n,A)>/(P; W)^f{\), 



where 



I{P- W) = H{PW) - J2 P{x)H{W,) 



is the mutual information of the channel W between the 
input distribution P and the output distribution PW = 

By using slightly stronger estimates, we even get 
Theorem 3: For every A G (0, 1) 

liminf - logM(n, A) > /(P; W). 

Proof. Let {E, D) be an optimal (n, A)-code. From eq. (|4|) 
we find (by a Markov inequality argument) that 

{2;" : \\\W:^ - D{E{x-))\W < Va| > 1 - VA. 



Denote the intersection of this set with the typical se- 
quences Tpg (see eq. (^) below) by A, with 5 
Then 

1 - \/A 



P"(yl) > 



= :A', 



and there exists an (n, A') -transmission code hi <Z A for the 
channel with \U\ > exp(n/(P; M^) - 0(V^)), see [| 
(the case of a classical-quantum channel W was done 
in [^). By construction this is a (n, 1 — A')-code for the 
channel D o E. 

We want now view E as belonging to the message en- 
coder, and D as belonging to the message decoder, the 
resulting code being one for the identical channel on C. 
Let us denote the concatenation of the map D with the 
channel decoder by S. On the other hand, we may replace 

by a deterministic map e, because randomization at the 
encoder never decreases error probabilities: (e, 6) still is an 
(n, 1 — A')-code. It is now obvious that |e~-^(c)| < A'~-^ for 
every c G C, hence 

M{n, A) = \C\ > X'\U\ = exp (n/(P; W) - 0(V^)) , 

and we are done. I 

It might be a bit daring to formulate conjectures at this 
point, so we content ourselves with posing the following 
questions: 



Question 4: Is it true that for all A G (0, 1) 

lim - log Af(n, A) = /(P; W) ? 

In fact, we would like to go present a slightly stronger state- 
ment: 

Question ^': For every A G (0, 1), e > 0, ^ > 0, and large 
enough n does there exists a (n, A)-code with 



1 



log|C| < /(P; W)+e 



and with the additional property that 

Vx-GTp", ^\\w:.-D{E{x-))\\,<Xl 
Here Tpg is the set of typical sequences: 
'^p,s ^ {2;" : Vx \N{x\x") - nP{x)\ < Sy/^a^} , (8) 

where N{x\x") counts the number of occurences of x in x", 
and ax '■— ■\/P(x)(l — P{x)). Observe that by Chebyshev's 
inequality 



P" (T^]s) > 1 



(9) 



In fact, by employing the Chernoff bound we even obtain 
P"(Tp^^) >l-|A'|exp(-^2)^ (^0) 

With these bounds it is easily seen that a positive answer 
to the latter question implies the same to the former. But 
also conversely, it is not difficult to show that a "yes" to 
question ^implies a "yes" to question lif. 

III. ... AND HOW TO ACHIEVE IT (CHEATING SLIGHTLY) 

The following construction is a generalization and refine- 
ment of the one by Bennett et al. Q (footnote 4), found 
independently by Diir, Vidal, and Cirac ||]. The idea there 
is to use common randomness between the sender and the 
receiver of the encoded messages. Formally this means that 
E and D also depend on a common random variable z/, uni- 
formly distributed and independent of all others. Note that 
this has a nice expression when viewing E and D as map 
valued random variables: here we allow dependence (via i^) 
between E and D, while in the initial definition, eq. (^), 
E and D are independent (as random variables). It seems 
that the power of allowing the use of common randomness 
can be understood from this point of view: it is a "convex- 
ification" of the theory with deterministic or independent 
encoders and decoders. 

It is easy to see that the lower bound of theorem ^ still 
applies here. We only have to modify the derivation a little 
bit: 

log\C\>H{E{X")\iy) 

> /(X" A£:(X")|j/) 

> I{X" /\D{E{X")\i^) 

> /(X" A y'» - n/(A) 
= /(X"Ay")-n/(A), 
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with a slight variant / of /. 

We shall apply an explicit large deviation estimate for 
sampling probability distributions from ^ (extended to 
density operators in [Q), which we state separately without 
proof: 

Lemma 5: Let Xi , . . . , Xm be independent identically 
distributed (i.i.d.) random variables with values in the 
function algebra on the finite set /C, which are bounded be- 
tween and 1, the constant function with value 1. Assume 
that the average EX^ = cr > si. Then for < 77 < 1/2 



9 

77 s 



I F E ^ [(1 ± ^)'^] } ^ 2|/C| exp -Af 2 1^ 2 



where [(1 ± ri)a] ~ [(1 — ri)a] (1 + ri)(7] is an interval in the 
value-wise order of functions: [A: B] = {X : Vfc A{k) < 
X{k) < B{k)}. ' ■ 

Before we prove our main theorem, we need three lemmas 
on exact types and conditional types. The first is a simple 
yet crucial observation: 

Lemma 6: Let Vl^ be a channel from A" to 3^, P a p.d. 
on A", Q = PW the induced distribution on 3^ and V the 
transpose channel from 3^ to A". 

Let i?, S be exact n-types of X, y, respectively that are 
marginals of a joint exact n-type T of A" x 3^. Consider 
the uniform distribution Pjj on on T^, which has the 
property 



77^ = L-4 (for e T"), 



and the channel from to Tg, 



-lT(x"y") 



(for a;"y" S Tp) 



where 7y(x") :— Tpn{{x^} x Tg ) is the set of conditional 
exact typical sequences of x". 

Then the induced distribution Qg — P^W^ on Tg is the 
uniform distribution, i.e. 

^s(^") = ^ = |^ (for ."em 

and the transpose channel to is indeed V^, defined by 



_ \T^\ _ y"(a;"|y") 



\T^\ y«(r^"(2/")l2/") 



with T^(y") := T^ D {T^ x {y"}). 
Proof. Straightforward. 



(for e r^"). 



Lemma 7: There is an absolute constant K such that for 
all distributions P on A", x"" £ T^, channels W : X ^ y 
and 6 > 

\T^J < exp(ni/(P) + KS\X\V^), 
\T^J>exp{nH{P)-KS\X\V^), 
IT^A^"")] < exp{nH{W\R) + KS\X xy\^), 
\'r^A^'')\ > exp{nH{W\R) ~ K6\Xxy\VT,). 

For (5 = 0, consider a joint n-type T on A" x 3^ with 
marginals i? on A" and S of y. Then, introducing the 
channel Z with T{xy) — R(x)Z{y\x): 

\TR\<eMnHm. 
|Tfl"|>(n + l)-l^lexp(nif(i?)), 
\T^{x-)\<eMnH{Z\K)), 

|r^(2;")| >{n + exp(ni/(Z|i?)). 
Proof. See 0]. ■ 

The third contains the central insight for our construction: 
Lemma 8: With the hypotheses and notation of lemma ^ 



there exist families (ylf^)fi=i,. 
such that for all v 



1^=1,... ,N, fromT" 



and 



^ E Vri-lY/^''^) e [(1 - (1 + e)PR] , (I.) 



^ e [(l-e)gg,(l + e)Qg], (II) 



NM ^ n 

for all M and that satisfy 

2\^2\Tl\\Tg 



M > 



NM > 



e 

21n2 



2 iTj, 



i^log(4iV|T«"|), 



^s"|log(4|Ti' 



Proof. Introduce i.i.d. random variables, distributed on Tg 
according to Q^g (i.e. uniformly). Then for all i^.^: 

E<5^M=QS, ^V^i-IY^"'') ^ PR. 

Hence lemma ^ applies and we find 

Pr{-I4 < 2\T^\ exp -M ' ^ ^' 



21n2|r, 



R I 



and Pr{^II} < 2\T^\ exp -iVA/ 



21n2|T5"| 

By choosing N and A/ according to the lemma we enforce 
that the sum of these probabilities is less than 1, hence 
there are actual values of the y/^'^^ such that all (l^) and 
(II) are satisfied. ■ 

With this we are ready to prove: 

Theorem 9: There exists an (n, A)-code (i?i/, -Di/)i/=i...Ar 
with 

\C\ < exp(n/(F;VK) +0(V^)) 
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and common randomness consumption 

N < expinH{W\P) + 0{y^)). 

In fact, not only the condition (||) is satisfied but the even 
stronger 



(11) 

Proof. Suppose x" is seen at the source, and that its type is 
R. For each joint n-type T of A" x 3^ we assume that families 
[y/^'^^) as described in lemma ^ are fixed throughout. 
Then the protocol the sender follows is: 

1. Choose a joint type T on X x y with probability 
W(T^(x")|a;") and send it. Note that T can be writ- 
ten T{xy) — R{x)Z{y\x), with the marginal R on X and a 
channel Z :X ^y. 

2. If R is not typical or T is not jointly typical then termi- 
nate. 

3. Use the common randomness to choose v uniformly. 

4. Choose jjL according to 



and send it. 

The receiver chooses = Yi['^\ using the common ran- 
domness sample v. Let us first check that this procedure 
works correctly: 

For typical x" we can calulate the distribution of con- 
ditional on the event that their joint type is T: this is then 
a distribution on 7^(2;"), and we assume T to be typical. 



JM Z^ 



1 1 M^"(y"|a;") 



1 + S(e), 



1 + B(e 



with the "big-B" notation: B{e) signifies any function 
whose modulus is bounded by e. Here we have used the 
definition of the protocol, then lemma ^ (for the definition 
of and the fact that W^(y"|x") does not depend on 
y" G 7y(a;")), then lemma 0. So, the induced distribution 
is, up to a factor between and equal to the cor- 
rect output distribution Now averaging over the 
typical T gives eq. 

What is the communication cost? Sending T is asymp- 
totically for free, as the number of joint types is bounded 
by the polynomial {n + l)!-*^^!. Sending /i costs logM 



bits, with M bounded according to lemma ||. That is. 



O(logn) 



logM<n max I(R:Z) 

\T typical 

< nI{P:W) + 0{^). 
On the other hand 

log^<n(^maxjF(PZ)-/(i?;Z))) 
< nH{W\P) +0{y/^), 

and we are done. ■ 

Remark 10: In the above statement of theorem ^ we as- 
sumed A to be a constant, absorbed into the "0(-y/n)" in 
the code length estimate. Using the Chernoff estimate (^0|) 
on the probabilities of typical sets in the above proof in fact 
shows the existence of an (n, A)-code satisfying ( |TT|) 



\C\ < exp(n/(P; W) + 0{- log A) v^) . 

In the line of Q , the interpretation of this result is that 
investing common randomness at rate H{W\P), one can 
simultate the noisy channel by a noiseless one of rate 
/(P; W), when sending only P-typical words. 

Considering the construction again, we observe that in 
fact not only it provides a simulation of the channel W , 
but additionally of the noiseless passive feedback. Simply 
because the sender can read off from his random choices 
the y" obtained by the receiver, too. This observation is 
the key to show that our above construction is optimal 
under the hypothesis that the channel with noiseless pas- 
sive feedback is simulated: in fact, since both sender and 
receiver can observe the very output sequence y" of the 
channel, which has entropy H{PW), they are able to gen- 
erate common randomness at this rate. Since communi- 
cation was only at rate I{P;W), the difference must by 
invested in prepared common randomness: otherwise we 
would get more of it out of the system than we could have 
possibly invested. Formally this insight is captured by the 
following result: 

Theorem 11: If the decoder of a (n, A)-code {E, D) with 
common randomness consumption v G [N] (with distribu- 
tion ^) depends deterministically on v and c G C (which 
is precisely the condition that the encoder can recover the 
receiver's output) then 

\C\ >exp {nI{P- W)~0{y/^)) , 
N\C\ > exp {nH{PW) - 0{y^)) . 

Proof For the first inequality introduce the channels A^'^n = 
D^{E^{x'^)), and their induced distributions R^^'> on 3^" 
and transpose channels with respect to P", i.e. 

P"(x")4^J(y") = p('')(y«)P^':^(a;"). 
Then we can rewrite eq. 



as 



< A. 
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to y" € Tq g and replace VJJ, and B^^ by their restrictions 

we have 
A', 



This inequality oviously remains valid if we restrict the sum which, by the same standard trick |2j] as before, yields our 

estimate: with 5 ~ \f^^^ the set S' — S C] T^pw s satisfies 

f-A 



to T^siy""): Vyl and B^^', respectively. 



On the other hand, choosing b 



^\x\\y\ 

f + A 



which yield 

Hence there exists at least one v such that 

> 1 - A'. 

Note that, as functions on A"", 

so, when we introduce the support S of the left hand side, 
we arrive at 

P"(5) > 1 - A', 



from which our claim follows by a standard trick p4[ : 
5' = 5 n Tj?^, , with (5' = Then 

and using the fact that 

Vx" e r^;^, P"(2;") < exp(-niJ(P) + A'| ^"1(5'^^), 
this implies 

1-A' 



let 



151 > \S'\ > 



■ exp{nH{P) - K\X\S'y/n). 



Now only note that (since is deterministic) 
\S\ < \C\ max |r^?,(y")| < |C|exp(ni/(l^|Q) +0(y^)), 

and by /(P; M^) = /(Q; V^) = H{P)-H{V\Q) we are done. 

Now for the second inequality: from the definition we 
get, by summing over x". 



1/ a;" 



< A. 



Because the are all deterministic, the distributions 
D^{E,y{x'^)) are all supported on sets of cardinality \C\. 
Hence the support S of J2i, Y^x^ Px^Dv{Ev{x^)) can be 
estimated \S\ < N\C\. 

On the other hand, we deduce 

{PW)"{S) > 1- A, 



(PVF)"(5') > 
but since for all y" £ T^pws 

(PM^)"(2/") < e-x^{-nH{PW) + K\y\6^), 
we can conlude 

N\C\ > \S\ >\S'\> exp{-nH{PW) + K\y\S^). 



Collecting these results we can state 

Corollary 12: For any simulation of the channel W to- 
gether with its noiseless passive feedback with error A < 1, 
at rate R and common randomness consumption rate C: 

P> C{W) =max/(P;iy), P + C > max i/(PTy). 

Conversely, these rates are also achievable. 
Proof. A simulation of the channel must be in the error 
bound for every input x", hence eq. (^) will be satisfied 
for every distribution P. The lower bounds follow now 
from theorem [Tl| by choosing P to maximize /(P; W) and 
H{PW), respectively. 

To achieve this, the encoder, on seeing cc" reports its type 
to the receiver (asymptotically free) and then they use the 
protocol of theorem ^ for P = P^^ , the empirical distribu- 
tion of x". Possibly they have to use the channel at rate 
C{W) — I{P; W) to set up additional common randomness 
beyond the given maxp H{PW) — C(W). ■ 

At this point we would like to point out a remarkable 
parallel of methods and results to the work : our use of 
lemma || is the classical case of of the use of its quantum 
version from [Q, and the main result of the cited paper is 
the quantum analog of the present theorem ^ The opti- 
mality result there has its classical case formulated in the- 
orems H (and H) and and even the construction of the 
following section has its counterpart there. 

The use of common randomness turned out to be re- 
markably powerful, and it is known in various occasions to 
make problems more tractable: a major example is the ar- 
bitrarily varying channel (see for example the review jl^). 
While for discrete memoryless channels it does not lead 
to improved rates or error bounds, it there allows for a 
"reverse" of Shannon's coding theorem |^ in the sense of 
simulating efficiently a noisy channel by a noiseless one. 
This viewpoint seems to extend to quantum channels as 
well, assisted by entanglement rather than common ran- 
domness: see B. We shall expand on the power of the 



"randomness assisted" viewpoint in section VI 
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IV. Solution under a letterwise criterion 

Here we show that from the theorem of the previous sec- 
tion a solution to the compression problem under a slightly 
relaxed distance criterion follows: whereas previously we 
had to employ common randomness to achieve the lower 
bound I{P; W), this will turn out to be unnecessary now. 
Specifically, our condition will be eq. (H): 
Theorem 13: There exists an n-block code (i?, D) with 

\C\ < exp {nI{P;W) + 0{y^)) , 

such that 

n 



2' 

fe=i 



A. 



Proof. Choose an (n, A)-code -Di/)jy=i,... ,Ar as in the- 
orem ^. Obviously this code meets the condition of the 
theorem, except for the use of common randomness. We 
will show that a uniformly random choice among a small 
(subexponential) number of ly is sufficient for this to hold. 
Then the protocol simply is: 

1. The sender choses v uniformly random (among the cho- 
sen few), and sends it to the receiver (at asymptotic rate 
0). 

2. She uses i?^ to encode, and the receiver uses D^, to de- 
code. 

By construction this meets the requirements of the theo- 
rem. 

To prove our claim, note that from theorem |^ we can 
infer 



< e. 



Introduce i.i.d. random variables Ti,... ,Tq, distributed 
according to With the notations xj;!^ = D^{E^{x")) 
and X^^jj. = D^{E^{x'^))k we have 



EX 



a;" J 



Denote the minimal nonzero entry of W by u, and choose 
e so small that for all typical and all k 

X^,^ > - onsuppPF,,. 

By lemma |] we obtain 

<2|:y|exp (-Q 



41n2 



Hence the sum of these probabilities is upper bounded by 
213^1 l-^-l exp (-Q^ 



which is less than 1 for 
4 In 2 



Q > 



(nlog|A'|+log(2|3;|)) 



Hence there exist actual values Ti , . . . , Tq such that 



< 3e, 



which is what we wanted to prove: observe that Q grows 
only polynomially. I 

As we remarked already in the introduction, [ p^ pro- 
posed to prove this result (and indeed more, being inter- 
ested in the tradeoff between rate and error), but eventually 
turned to the much softer condition (^, which originates 
from the traditional model of rate distortion theory. 

V. A general construction 

Nice though the idea of the previous section is, the lower 
bound results show that on this road we cannot hope to 
approach the conjectured bound, because without common 
randomness at hand we have to spend communication at 
the same rate to establish it (compare [ p5[ , appendix, for 
this rather obvious-looking fact). 

In this section we want to study the perfect restitution 
of the probability distributions Wx (i.e. A = 0): 

Recall that here we want to minimize H{E{X^)), and 
this minimum we call S{n). Obviously S{ni + < 
S{ni) + S{n2), so the limit 

r{P,W) = lim -S{n) 

n^oo n 

exists, and is equal to the infimum of the sequence. 
Then we have 

Theorem U: For aU A G (0, 1) 

limsup-logM(n, A) < T{P,W). 

n — >oo ri 

Proof. It is sufficient to prove the inequality for S{1) in 
place of r(P, VF): 

Fix a 1-code (e, d) with H{e{X)) = 5(1). Then, for n > 
1 choose any (n, A)-source code (F, G) for e{Xi) . . . e{Xn), 
which is possible at rate H{e{X)) + o{l). Then {E, D) with 
E — F oe" and 13 = c?" o G is an {n, A)-code for the mixed 
state source with limiting rate S{1). ■ 

It would be nice if we could prove also an inequality in the 
other direction, but it seems that a direct reduction like in 
the previous proof does not exist: for this we would need 
to take an (n, A)-code and convert it to an (n, 0)-code, 
increasing the entropy only slightly. 

A nice picture to think about the problem of finding S{1) 
is the following in the spirit of fiow networks: 

From the source we go to one of the nodes x & X, 
with probability P{x). Then, with a probability of E^c ~ 
E(x){c) we go to c e C, and from there with a probability 
of Dcy = D{c){y) to y £ y. Then the condition is that 



cec 
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Fig. 1 

The probability flow network to simulate the distributions 
Wx- Note that we included a sink, edges leading to the sink 

OBVIOUSLY having PROBABILITY 1. 



Examples of this constructions are discussed in (where 
it was in fact invented), and here we want to add some 
general remarks on optimizing it, as well thoughts on a 
possible algorithm to do that. 

We begin with a general observation on the number of 
intermediate nodes: 

Theorem 15 ("c dice with d sides") An optimal zero er- 
ror code for W requires at most CD — 1 intermediate nodes, 

withC= lA-i, D = \y\. 

Proof. For a fixed set C the problem is the following: 
Under the constraints 

yxc > 0, Va; ^ E.,, = 1, (12) 

C 

ycyD,y>Q, Vc^I?cy = l, (13) 

y 

Vxy J2ExcDcy^W^{y), (14) 

c 

minimize the entropy H{fj,), where fic = PxExc- 

Observe that for each fixed set of Dcy the constraints 
define a convex admissible region for the E^c, of which a 
concave function is to be minimized. Hence, the minimum 
will be achieved at an extreme point of the region, that we 
rewrite as follows: 



Exc > : ^xy 



Wxiy) 



Exc > : V?/ ExcDcy 



Wxiy) 



An extreme point must be extremal in every of the sum- 
mand convex bodies Bx- On the other hand, an extreme 
point of Bx must meet dimBx many of the inequalities 
{Exc > 0) with equality. Since dimi?^; > |C| — D there 



remain only at most D nonzero Exc for every x. In partic- 
ular, only at most CD many c € C are accessed at all. In 
fact, to minimize H(fj,), at most CD — 1, otherwise c would 
contain full information about x. I 

Remark 16: The last argument can be improved: for 
C,D > 2 we can even assume \C\ < CD — C + 1. 

The argument of the proof gives us the idea that maybe 
by an alternating minimization we can find the optimal 
code: 

Indeed, conditions (|l2|) and (|l^) for fixed D are linear in 
E, and the target function is concave (entropy of a linear 
function of E), so we can find it's minimum at an extreme 
point of the admissible region. This part is solved by stan- 
dard convex optimization methods. On the other hand, 
for fixed E, eqs. ^ and (|l|) are linear in D. However, 
variation does not change the aim function. Still we have 
freedom to choose, and this might be a good rule: let D 
maximize the conditional entropy H{D\fi). The rationale 
is that this entropy signifies the ignorance of the sender 
about the actual output. If it does not approach H{W\P) 
in the limit this means that the protocol simulates partial 
feedback of the channel W, which could be used to extract 
common randomness. This amount is a lower bound to 
what the protocol has to communicate in excess of I{P; W). 
We have, however, no proof that this rule converges to an 
optimum. 

VI. Applications 

In this section we point out three important connections 
to other questions, some of which depend on positive an- 
swers to the questions ^ and 

A. Common randomness 

It is known that if two parties (say, Alice and Bob) have 
access to many inpendent copies of the pair of random 
variables {X,Y) (which are supposed to be correlated), 
then they can, by public discussion (which is overheard 
by an eavesdropper), create common randomness at rate 
I{X A Y), almost independent of the eavesdropper's infor- 
mation. For details see where this is proved, and also 
the optimality of the rate. One might turn around the ques- 
tion and ask, how much common randomness is required to 
create the pair {X",Y^) approximately. This question, in 
the vein of that of the previous subsection, is really about 
reversibility of transformations between different appear- 
ances of correlation. Note that this was confirmed in |Q 
for the case of deterministic correlation between X and Y, 
i.e. H{Y\X) = H{X\Y) = 0, which there was parallelled to 
entanglement concentration and dilution for pure states. 

An affirmative answer to question ^, surprisingly implies 
that a rate oi I{X /\Y) of common randomness is sufficient, 
with no further public discussion to create pairs X, Y. This 
is done by first creating the distribution Q of i?(Ar") on C 
from the common randomness (this Alice and Bob do each 
on their own!): this may be not altogether obvious as the 
common randomness is assumed in pure form (i.e. a uni- 
form distribution on N alternatives), while the distribution 
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Q may have no regularity. To overcome this difficulty fix 
an e > and let 



logic I - loge 



Now we partition the unit interval into the subintervals 



la = 



(f + e)-(''-i),(f + 6)-'^), a-f,... 



[(1 



and define Ca = {c £ C : Q{c) G /q}, qa = Q{Ca)- Notice 
that for a < oo the probabilities for c's belonging to the 
same set Ca differ from each other only by a factor between 
f — e and f +e, and that q^o < e, because of (l+e)"'' < e/|C|, 
by definition of k. Hence, defining uniform distributions Ua 
on Ca for a < oo, it is immediate that 



< 2e. 



Now the distribution on the a = 1 , . . . ,k in this formula 
can be approximated to within 1/fc by a fc^~type distribu- 
tion, which in turn can be obtained directly from a uniform 
distribution on alternatives. In this way we reduced ev- 
erything to a number of uniform distributions, maybe on 
differently sized sets, all bounded by \C\ and a helper uni- 
form distribution on a set of size fc^. However, it is well 
known that these can be obtained from a uniform distribu- 
tion on 0(fc^|C|) items within arbitrarily small error. 

Given this distribution on C, Bob applies D, whereas 
Alice applies the transpose channel E' to E. One readily 
checks that this produces the joint distribution of X", Y", 
up to arbitrarily small disturbance in the total variational 
norm. 

Note that this result would imply a new proof of the op- 
timality of of the rate I{X A Y) of common randomness 
distillation from because we can simulate the lat- 

ter pair of random variables with this rate of common ran- 
domness, we would obtain a net increase of common ran- 
domness after application of the distillation, which clearly 
cannot be. 

B. Channel coding 

It was already pointed out that this study has the pa- 
per 1^ as one motiviation, with its idea to prove the opti- 
mality of Shannon's coding theorem by showing that every 
noisy channel W can be simulated by a binary noiseless 
one operating at rate C{W). Shannon's theorem is un- 
derstood as saying that the noisy channel can simulate a 
binary noiseless one of rate C{W). Both simulations are al- 
lowed to perform with small error. Note that an affirmation 
of question ^', implies that this can be done, without the 
common randomness consumption like in section III. As 
indicated, this provides a proof of the converse to Shan- 
non's coding theorem: 

The idea is that otherwise we could, given a rate of C(VF) 
noiseless bits simulate the channel, which in turn could be 



used to transmit at a rate R > C{W). The combination 
of simulation and coding yields a coding method for trans- 
mitting R bits over a channel providing C{W) noisless bits, 
which is absurd (in jql this reasoning is called "causality ar- 
gument"). Theorenip^ allows us to prove even more: 

Theorem 17 (Shannon [^) For the channel W with 
noisless feedback (i.e. after each symbol x transmitted the 
sender gets a copy of the symbol y read by the receiver, 
and may react in her encoding) the capacity is given by 
C{W). In fact, for the maximum size Mi{n, A) of an {n, A)- 
feedback code 

Mi{n, A) < exp{nC{W) + 0(Vn)). 
Proof. Let an optimal (n, A)-feedback code for the channel 
VF" with noiseless feedback be given. We will construct an 
(n^, A')-code with shared randomness, as follows: 

Choose a simulation of the channel W on n-blocks send- 
ing nC{W) + 0{y/n\ogn) bits, and using shared random- 
ness, and with error bounded by e = (this is possible 



by the construction of theorem y — see remark 10). We 
shall use n independent copies of the feedback code in par- 
allel: in each round n inputs symbols are prepared, sent 
through the channel, yielding n respective feedback sym- 
bols. Obviously, each round can be simulated with an error 
in the output distribution bounded by e, using our simu- 
lation of the channel W (which, as we remarked earlier, 
simulates even the feedback). In each of the parallel exe- 
cutions of the feedback code thus accumulates an error of 
at most increasing the error probability of the code 

to ii^. Hence on the block of all the n feedback codes we 
can bound the error probability by A' = 1 — {^-^)" ■ 

But this is subexponentially (in N ~ n^) close to 1, so a 
standard argument applies: 

First, by considering average error probability we can 
get rid of the shared randomness: there exists one value 
of the shared random variable for which the average error 
probability is bounded by A'. Then we can argue that 
there is a subset U of the constructed code's message set 
which has maximal error probability bounded by A" = 
and 

> (1- A")|A^"|. 

What we achieved so far hence is this: a code of \U\ 
messages with error probability A" and using NC{W) + 
o{N) noiseless bits. Clearly, we may assume the encoder 
to be deterministic without losing in error probability. But 
then at most (1 — A")~^ messages can be mapped to the 
same codeword without violating the error condition. 

Collecting everything we conclude 

iA^r < (i-A")"'iwi 

< (1 - A")"' exp {n^C{W) + o{n^)) 
= [exp(nC(W^)+o(n))]", 

implying the theorem. ■ 

Remark 18: The weak converse (i.e. the statement that 
the rate for codes with error probability approaching is 
bounded by C(VF)) is much easier to obtain, by simply 
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keeping track of the mutual information between the mes- 
sage and the channel output through the course of oper- 
ating a feedback code, using some well-known information 
identities, and finally estimating the code rate employing 
Fano's inequality. 

C. Rate-distortion theorem 

Let d : X X y — > R>o be any distortion measure^ i.e. 
a non-negative real function. This function is extended to 
words X" X by letting 



d'\x\y'')=Y,d{xk,yk). 



Shannon's rate distortion theorem is about the following 
problem: construct an n-block code (E, D) (which my be 
chosen to be deterministic) such that for a given d>0 

d{E, D) P"(x")d"(x", D{E{x"))) < nd, 



i.e., the average distortion between source and output word 
is bounded by nd. 

A pair (i?, d) of non-negative real numbers is said to be 
achievable if there exist n-block codes with code rate tend- 
ing to R and distortion rate asymptotically bounded by d. 
Define the rate-distortion function R{d) as the minimum 
R such that {R, d) is achievable. 

Theorem 19 (Shannon |l7| ) The rate distortion function 
is given by the following formula: 

R{d) = min {/(P; W) : W channel s.t. m{X, Y) < d} , 

where Ed(X, F) — "^^y P{x)Wx{y)d{x,y) is the expected 
(single-letter) distortion when using the channel W . 
The proof of ">" here is a simple exercise using convexity 
of mutual information in the channel and standard entropy 
inequalities. We can give a simple proof of the "<"-part 
of this result, using theorem |ll|: 

Choose some channel W satisfying the distortion con- 
straint. Then mapping to obviously satisfies the 
distortion constraint on the code in the sense that the ex- 
pected distortion between input and output, over source 
and channel, is bounded by nd. Of course, sampling W"„ at 
the encoder and sending some y" will not meet the bound 
J(P; W). However, we can apply theorem ^ to approxi- 
mately simulate the joint distribution of and ?/" by us- 
ing some common randomness v and a deterministic code 
{E^,D^) sending nI{P; W) + 0{^/n) bits. Hence, invoking 
linearity of the definition of d{E, D), 



x^d{E^,D^) < nd + Oie), 



so there must be one v such that d{E^, D^) < nd + 0(e), 
which ends our proof. 

At this point we would like to advertise our point of view 
that theorem 13, and even more so theorem ^ is what rate- 
distortion is actually about: the former theorem shows how 



to simulate a given channel on all individual positions of a 
transmission, and this is what we need in rate-distortion. 
In fact, rate-distortion theory is unchanged when instead 
of the one convex condition ( "distortion bound" ) on the 
code we have several, effectively restricting the admissible 
approximate joint types of input and output to any pre- 
scribed convex set — in particular a single point. 

The strength of theorem |l3| in comparison to such a de- 
velopment of rate-distortion theory lies in the fact that 
with its help we satisfy the convex conditions in every let- 
ter, not just in the block average. And theorem ^ gives 
the analogue of this even with the condition imposed on 
the whole block, yielding results that are not obtainable 
by simply applying rate-distortion tools (see e.g. p^). 

VII. Compression of sources of quantum states 

The problem studied in this paper has a natural exten- 
sion to quantum information theory: now the source emits 
(generally mixed) quantum states Wx on the Hilbert space 
y {x G X), with probabilities P{x), and an (n, A)-code is 
a pair (E, D) of maps 



E : A"" 
D : 6(C) 



6(3^®"), 



(15) 



where 6(C) is the set of states on the code Hilbert space C 
and D is completely positive, trace preserving, and linear. 
The condition to satisfy is 



DiEix-))\U < X, (16) 



with the trace norm || • ||i on density operators. Define, like 
before, M(n, A) as the minimum dimC of an {n, A)-code. 
Sometimes, the stronger condition 



Va;" G T, 



p. 5 



w: 



i?(i?(x"))||i < A (17) 



will be applied. 

Notice that this contains our original problem as the spe- 
cial case of a quasiclassical ensemble, when all the px com- 
mute (which means they can be interpreted as probability 
distributions on a set of common eigenstates). 

This problem (with a number of variations, which we ex- 
plained in the introductory section | for the classical case) 
is studied in There (and previously in ||ll[]) it is shown 
that the lower bound theorem ^ holds in the quantum case, 
too, with understanding H as von Neumann entropy: 

Theorem 20: For all n, A 



1 



M(n,A) >/(P; W)-f{\), 



with a function /(A) ^ for A ^ 0. ■ 
Let us improve this slightly by proving the strong version 
of this result: 

Theorem 21: For aU A G (0, 1) 

liminf-Af(n, A) > I{P; W). 

n — >OQ fi 
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Proof. By much the same method as the proof of theorem ^ 
the changes are that we need the more general code selec- 
tion result of |2^, thm. II. 4, instead of the classical theo- 
rem [||, and which we state separately below: if {E,D) is 
an optimal (n, A)-code, define 



A 



^(i?(x"))||i< Va 



Obviously P"{A) > 1 — > 0, so we can apply lemma [2^ 
and find an (n, e)-transmission code U d A for W" such 
that 

\U\ > exp(n/(P; W) - 0(V^)) . 

This is an {n, A')~code for the channel D o E, with A' = 
VX + e < 1, if we choose e small enough. Combining E 
with the transmission encoder, and D with the transmis- 
sion decoder, we obtain an (n, A)-transmission code for \U\ 
many messages over a noiseless system with Hilbert space 
C of dimension M{n, A). 

To each message u & U there belongs a decoding operator 
An > on the coding space C, forming together a POVM: 
^„A„ = 1. Now to decode correctly with probability 
1 — A', for each u we must have 

Tr A„ > 1 - A'. 

On the other hand, by J^u A„ — M{n, A), we conclude 

M{n,X) =dimC > ^-^1^1 



> 



—^cxp{nI{P;W)-OiV^)), 



and we are done. H 

Lemma 22: For < r, A < 1 there is a constant K' and 
(5 > such that for every discrete memoryless quantum 
channel W and distributions P on A" the following holds: 
if C is such that P^{A) > r then there exists an 
(n, A)-transmission code {E, D) with the properties 



VmeM E{m) e A and TrD^<Trir}j 



f{m),S 



Proof. See 



\M\> exp(n/(F; W) - K'y/^). 
>!, thm. II.4. 



Progress on the problem of achievability of this bound 
is not known to us. It is remarkable that Koashi and 
Imoto could obtain the exact optimal bound in the 
case of blind coding. It is indirectly defined via a canonical 
joint decomposition of the source states, but it can be de- 
rived from their result that generically the optimum rate is 
H(PW), which is achieved by simply Schumacher encoding 
the ensemble {P{x), Wx}- 

Nevertheless, the results obtained in the classical case 
are very encouraging, so we state two conjectures: 

Conjecture 23: For < A < 1 there exist (n, A)-codes 
with common randomness, asymptotically achieving trans- 
mission rate /(P; W) and common randomness consump- 
tion H{W\P). 



If it turns out true, and also question ^ has a positive 
answer, we might even hope that also 
Question 24: For < A < 1, is 

lim -logM(n,\) = I{P;W) ? 

n— »oo n 

[Note that, as in the case of question ^ codes achieving the 
optimal bound may also be constructed to satisfy eq. (|l7|).] 
answers "yes" . 

The implications of these statements, if they are true, 
would be of great significance to quantum information the- 
ory: not only would we get a new proof of the capacity of a 
classical-quantum channel being bounded by the maximum 
of the Holevo information and for the optimality of common 
randomness extraction from a class of bipartite quantum 
sources but also the achievability of I{P;W) in the 
quantum rate distortion problem p3[ | with visible coding 
would follow, that until now has escaped all attempts. 

VIII. Concluding remarks 

We demonstrated the current state of knowledge in the 
problem of visible compression of sources of probability dis- 
tributions and its extension to mixed state sources in quan- 
tum information theory. Apart from reviewing the cur- 
rently known constructions we contributed a better under- 
standing of the resources involved: in particular the use of 
common randomness in some of them, and providing strong 
converses. Also we showed the numerous applications the 
result (and sometimes the conjectures) have throughout in- 
formation theory, making the matter an eminent unifying 
building block within the theory. 

We would like to draw the attention of the reader once 
more to our questions ^ and and especially the conjec- 
ture ^ offering them as a challenge to continue this work. 
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